From: Bill Dally [billd@csl.stanford.edu]
Sent: Thursday, October 11, 2001 8:56 AM
To: Bill Mark; Bill Dally
Cc: Ian Buck; lance@leland.Stanford.EDU; ujk@leland.Stanford.EDU;
Billmark@lambert.stanford.edu; pmattson@leland.Stanford.EDU;
serebrin@stanford.edu; merez@stanford.edu; jowens@graphics.stanford.edu;
horowitz@chroma.stanford.edu; hanrahan@graphics.stanford.edu
Subject: Re: Brook - multiple input and output streams

Bill,

Three comments.

First, by the very one-dimensional nature of a stream, the use of 
persistant state is one dimensional - early elements affect later elements 
but not vice-versa.  Thus, the computation on persistant state can be 
converted to a log-depth parallel prefix computation - most of which takes 
place locally in one processing "element" - and only the top levels of the 
tree are between elements.

Key to managing this type of communication - which we need anyway for 
reductions - like the "max" in Ian's example - is setting the proper 
traversal order.  The traversal order wants to be in order of least 
expensive communication/synchronization.  Roughly speaking the order is - 
across "clusters" in a clustered processor (MIMD or SIMD), across time in 
one processor, across "nearby" processors (e.g., on a chip or board), 
across distant processors.  This order means that very little of the log 
tree structure crosses the expensive links.

Whether a processor is SIMD or MIMD is immaterial to the cost of 
communication.  What really matters is how tightly synchornized they are - 
not that they are all doing the same instruction.  Processors that are 
synchronized can exchange some data directly - without buffering or 
waiting.  There are lots of good technologies to synchronize MIMD 
processors (read our M-Machine papers) - SIMD processors are inherently 
synchronized.

----Bill

At 11:46 PM 10/10/2001 -0700, Bill Mark wrote:
>On Wed, 10 Oct 2001, Bill Dally wrote:
>
> >...
> > The conditional and unconditional << and >> operators in KernelC are pretty
> > useful.  You should consider adopting this model in Brook - where the
> > kernel is 'active' for the duration of the stream - rather than a single
> > element - and can explicitly read from inputs and write to outputs.
> > ...
>
>"Persistent kernels" are powerful, but allowing them does present
>some challenges, especially from a programming-model point of view.
>
>What happens when you have more than one processing unit operating on
>the stream?  There are duplicate copies of the persistent state,
>possibly with different values.  On Imagine, you have to add explicit
>communication between the processors to do something reasonable in
>this situation.  My impression is that this communication code is
>often designed for a particular number of processors.  It's not clear
>to me that there's any simple way to hide the number of processors.
>
>If the language provided the "illusion" of one processor, the compiler
>might be able to insert the appropriate communication, but I suspect
>that it would then be very easy for the programmer to create code that
>would run very slowly when the number of stream processors is high.
>We should look at this question in more detail; my instinct may be
>wrong here.
>
>Allowing communication between processing units also seems to require
>that the processors be SIMD.  SIMD's ability to simply perform
>communication of this type is perhaps its greatest advantage over
>MIMD.
>
>Bill M.
>
>
>

--------------------------------------------------------------------------
Bill Dally                billd@csl.stanford.edu             (650)725-8945
Professor of Electrical Engineering and Computer Science  FAX(650)725-6949
Computer Systems Laboratory, Stanford University
Gates Room 301
Stanford, CA  94305                         http://csl.stanford.edu/~billd
--------------------------------------------------------------------------